Intel C/C++ Compiler Plug-in

Information in this document is provided in connection with Intel products. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted by this document. Except as provided in Intel's Terms and Conditions of Sale for such products, Intel assumes no liability whatsoever, and Intel disclaims any express or implied warranty, relating to sale and/or use of Intel products including liability or warranties relating to fitness for a particular purpose, merchantability, or infringement of any patent, copyright or other intellectual property right. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice.

*Third-party brands and names are the property of their respective owners.

1.0. Overview of the Intel C/C++ Compiler Plug-in

The Intel C/C++ Compiler Plug-in version 2.4 can easily be integrated into the Microsoft Developer Studio environment and allows users to use Pentium Ž Pro and Pentium Ž II processor inline assembly instructions which are currently not supported with the latest version of Microsoft Visual C++, version 5.0. The Intel C/C++ Compiler Plug-in is fully compatible with the Microsoft Visual C++ 4.x or later compilers in the following areas: command line switches, inline assembly format, object module, library and DLL formats, debug and C++ symbol formats.

The Intel C/C++ Compiler Plug-in provides additional optimizations that are not currently available with the Microsoft Visual C++ Compiler. For example, the compiler provides a rounding control option which optimizes floating point to integer conversions. The compiler also supports the Pentium Pro and Pentium II processor specific instructions. This application note will talk about these and other features available with the Intel C/C++ Compiler Plug-in which are useful in optimizing applications. The first few sections discuss how to install and setup the compiler and the next sections offer optimization techniques and an analysis of their performance.

2.0. Installing the Intel C/C++ Compiler Plug-in

2.1. Hardware Requirements

At least Intel486^TM processor based system
At least 16M Bytes of RAM
At least 100 M Bytes of Hard Disk space

2.2. Software Requirements

Microsoft* Windows NT or Windows 95
Microsoft Visual C++ 4.x or later
Microsoft Macro Assembler (MASM) version 6.11d (need an assembler if using the -Quse_asm option, which allows you to compile source files into an assembler file and then specify an assembler to generate the object file)

2.3. Installation Procedure

The Intel C/C++ Compiler Plug-in installs directly into the Microsoft Visual C++ version 4.x or 5.0 environment. The installation program installs the Intel Compiler Selection Tool and makes the compiler accessible from the Developer Studio tools menu.

To compile a program using the Intel C/C++ Compiler Plug-in, choose the Select Compiler option located in the tools menu. A window will appear (figure 2.3.) which will allow the user to toggle between the Intel C/C++ Compiler Plug-in and the Microsoft Visual C++ Compiler.

Figure 2.3. Select Compiler Window

When switching between the two compilers select the Rebuild All option under the build menu to ensure the application is rebuilt with the newly selected compiler.

3.0. Benefits of the Intel C/C++ Compiler Plug-in

3.1 Some Useful Optimization Switches

In table 3.1, some useful optimization switches are given with a brief description. The full listing of the available switches can be found in the "Intel C/C++ Compiler Plug-in User's Guide for Win32 Systems" which comes with the compiler. The following sections of this application note will cover how to use some of the optimization switches listed below and the performance gain that could be obtained when using these switches.

Optimization Switch	Description of When to Use the Optimization Switch
-GB or -G3	Used by default. Use this compiler switch when the application needs to run on a wide range of Intel processors
-G4	Use to optimize code exclusively for the Intel486 processor
-G5	Use to optimize code exclusively for the Pentium processor
-G6	Use to optimize code exclusively for the Pentium Pro and Pentium II processors.
-Qxi	Allows the use and generation of Pentium Pro specific instructions
-Qmem	Use for memory optimizations to improve cache accesses and reduce memory accesses
-Qprec	Use to improve the floating-point precision
-Qrcd	Use to improve floating-point to integer conversions, by disabling the floating point rounding control

Table 3.1 Some Useful Optimization Switches

3.2. Support for Pentium Pro and Pentium II Processor Specific Instructions

Using the Intel C/C++ Compiler Plug-in allows Pentium Pro and Pentium II processor specific instructions to be used. These instructions (CMOVcc,FCMOVcc, FCOMI, RDPMC, and UD2) are not currently supported within Microsoft Visual C++ version 5.0. The CMOVcc, and FCMOVcc instructions are very powerful instructions on the Pentium Pro and Pentium II processors because they could improve the performance of applications which contain a lot of conditional branches. By using the CMOVcc and FCMOVcc instructions, the number of branches in the application will be decreased, which should improve the overall performance of the application.

The CMOVcc and FCMOVcc instructions are conditional move instructions. These instructions check the state of one or more of the status flags and perform a move operation if the flags are in a specified state. Using the CMOVcc instruction is beneficial because it does not require a branch which could be mispredicted. For example:

	CMP EAX,EBX	;compare and set flags 
	CMOVGE	EAX,EBX	;if eax >= ebx then set eax=ebx otherwise no change

The following code would need to be used it the CMOVcc instruction is not supported:

	CMP EAX,EBX	;compare the value in eax with the value in ebx
	JL NOTGE	;if !(eax >= ebx) jump over move instruction
	MOV EAX,EBX	;set eax = ebx because ebx >= eax
NOTGE:			
	...		;additional code

Code Example 3.2. Comparing code for CMOVcc

The following sections discuss how the Intel C/C++ Compiler Plug-in supports the Pentium Pro and Pentium II processor specific instructions. Examples are given using the CMOVcc instruction, but the examples can easily be applied to the other specific instructions.

3.2.1. How to Use the Instructions as Inline Assembly Code

The Intel C/C++ Compiler Plug-in allows Pentium Pro and Pentium II processor specific instructions to be used as inline assembly instructions. This is not currently supported with the Microsoft Visual C++ version 4.x, or 5.0 compiler. The Pentium Pro and Pentium II processor specific instructions can be used as inline assembly instructions simply by using the asm directive (code example 3.2.1.).

	int func(int x, int y)
	 {
	 int I;
	 _asm
	    {
		mov 	eax,x
		mov 	ebx,y
		cmovge  eax,ebx
		mov	I,eax
	    }
  	return I;
         }

Code Example 3.2.1. Using CMOVcc with _asm Directive

3.2.2. How to Use the Compiler to Generate the Instructions

The Intel C/C++ Compiler Plug-in not only allows the user to use the specific Pentium Pro and Pentium II processor instructions as inline assembly instructions, but will also generate the specific assembly instructions to optimize C code. This optimization occurs when certain compiler switches are set. These switches notify the compiler that it should generate assembly code specifically for the Pentium Pro and Pentium II processors. The compiler switches that need to be set in order to generate Pentium Pro and Pentium II processor specific assembly instructions are -G6 and -Qxi. These settings can be set by selecting Project from the menu and choosing the Settings option. Select the C/C++ tab and type the specified settings into the project options box. The sample project settings window is shown below. Figure 3.2.2 shows the Project Settings dialog box with the specified project settings.

Figure 3.2.2 Project Settings using the Pentium Pro Processor's Optimization Switches

The Maximize Speed optimization option must also be set to produce the most optimal code.

To view the assembly output generated by the compiler, select Project from the menu and select the Settings option. A project settings dialog box will appear. Select the C/C++ tab. Under the category drop down menu select Listing Files and specify the Listing File Type and the Listing File Name. Figure 3.2.3 shows the project settings dialog box with a listing file type of Assembly-Only Listing and an output file location to place the assembly code listing.

Figure 3.2.3 Project Settings for Creating an Assembly Listing File

3.2.3 Example Code

To show the benefit of using the -G6, -Qxi Pentium Pro processor optimizations a simple example is provided. The next sections will discuss the assembly code generated by using both the Intel C/C++ Compiler Plug-in and the Microsoft Visual C++ 5.0 Compiler. A performance analysis will be provided from the assembly code listing provided by the compiler and the RDTSC and CPUID instruction to measure the cycle time. The RDTSC instruction reads the current cycle count and the CPUID instruction is used to synchronize instructions. The CPUID instruction is necessary because both the Pentium Pro and Pentium II processor execute instructions out-of-order. The code used to demonstrate the generation of the CMOVcc instruction is given below:

	int time_left[32];
	//Loops 32 times storing the time left. The algorithm used is as follows
	//If(time_to_waste < t)
        //then set time_left = time_left - time_to_waste
 	//else set time_left = time_left - t
	void timeloop( int time_to_waste)
	  {
	  for(int i; i<32;i++)
	     {
             int t = time_left[32];
   	     time_left[i] - = time_to_waste < t ? time_to_waste : t;
             }
	  }

Code Example 3.2.3. Generation of CMOVcc

Using the Microsoft Visual C++ Compiler

Code listing 3.2.3.1 describes the assembly code generated by the Microsoft Visual C++ Compiler version 5.0.

	push	esi				;Store the current value of esi	
	mov	esi, DWORD PTR _time_to_waste$[esp]	;esi  = time_to_waste
	mov	ecx, OFFSET FLAT:?time_left@@3PAHA	;ecx  = time_left
$L169:
	mov	eax, DWORD PTR [ecx]	;eax = t = time_left[i]
	cmp	esi, eax		;compare time_to_waste to t
	mov	edx, esi		;edx = time_to_waste
	jl	SHORT $L180		;if time_to_waste < t jump 
	mov	edx, eax		;if time_to_waste !< t set edx = t
$L180:
	sub	eax, edx		;eax = time_left[i]-(either t or time_to_waste)
	mov	DWORD PTR [ecx], eax	;store value in array time_left[i]
	add	ecx, 4			;increment to next array value	
					;Have we reached the end of the array
	cmp	ecx, OFFSET FLAT:?time_left@@3PAHA+128
	jl	SHORT $L169		;keep looping until the end of the array is reached
	pop	esi			;restore esi value
	ret	0			;return
	
Code Example 3.2.3.1 Assembly Code Generated from the Visual C++ Compiler

Using Intel C/C++ Compiler Plug-in

Code listing 3.2.3.2 describes the assembly code generated by the Intel C/C++ Compiler Plug-in. Notice the use of the cmovle instruction.

	push ebx			;store the current value of ebx
	mov ecx, DWORD PTR [esp+8]		;ecx = time_to_waste
	mov edx, -128			;edx contains value of i for array index
_B1_3:
	mov	eax, DWORD PTR time_left[edx+128] ;eax = t = time_left[i]
	cmp eax, ecx			;compare t with time_to_waste to set flags
	mov	ebx, ecx		;ebx = time_to_waste
	cmovle	ebx, eax		;if(t <= time_to_waste) set ebx = t
	sub	eax, ebx		;time_left[i] - (either t or time_to_waste)
	mov	DWORD PTR time_left[edx+128], eax ;write the result to the array
	add	edx, 4				  ;increment to the next array value
	jnz	_B1_3			;keep looping until the entire array is traversed
	pop	ebx			;restore the value of ebx
	ret

Code Example 3.2.3.2 Assembly Code Generated from the Intel C/C++ Compiler Plug-in

3.2.4 Performance Analysis

The performance analysis of the assembly instructions generated by each compiler are provided in Table 3.2.4. The cycles counts were obtained by using the RDTSC and CPUID instructions on a 266Mhz Pentium II processor.

COMPILER USED	NUMBER OF ASSEMBLY LANGUAGE INSTRUCTIONS	TOTAL NUMBER OF CYCLES
Microsoft Visual C++ 5.0 Compiler	15 Instructions	266 Cycles
Intel C/C++ Compiler Plug-in	13 Instructions	247 Cycles

Table 3.2.4. Generation of CMOVcc Performance Comparison

The percentage improvement is only 7% for this simple example, but if an application contains a substantial amount of jumps and branches this optimization could significantly improve the overall application.

3.3. Optimizing Floating Point to Integer Conversions

The Intel C/C++ Compiler Plug-in provides an optimization switch to improve the performance of floating point to integer conversions. Graphics applications which use floating point data as input into their rendering operations can benefit from this type of optimization. The rendering operations usually take floating point data as inputs and a conversion then needs to be made from floating point to integer. Any speed up in the conversion provides a benefit to the application.

The compiler switch that improves the floating point to integer conversions is the rounding control option, -Qrcd. the switch optimizes the conversion by controlling the change in rounding modes that generally take place during floating point calculations. In the C language, the floating point values must be truncated before converting the values to integer. The default rounding mode for the system is round-to-nearest. Therefore, in order to truncate the floating point values, a rounding mode switch must occur. The rounding mode then has to be switched back to the default, round-to-nearest, after the truncation takes place. Switching rounding modes adds additional overhead to each floating point calculation. By using the rounding control option on the Intel C/C++ Compiler Plug-in, the additional overhead associated with changing rounding modes is eliminated. The -Qrcd option does not effect the floating point calculations. However, since the rounding mode changes are eliminated the integer conversions do not conform to the C semantics.

3.3.1 How to Use the Floating Point to Integer Conversion Optimization

To use the floating point rounding control optimization, the -Qrcd switch must be set in the project settings. Select Project from the menu and choose Settings. A dialog box will appear, select the C/C++ tab and type in the specified settings in the project options box. An example is provided below using the -Qrcd setting as a project option.

Code Example 3.3.1 Project Options using the Rounding Control Option

3.3.2.Floating Point Example Code

To show the benefit of using the -Qrcd rounding control option for floating point conversions a simple example is provided. The next sections will discuss the assembly code generated by using both the Intel C/C++ Compiler Plug-in and the Microsoft Visual C++ 5.0 Compiler.

	int a = 5;
	float b = 1.4;
	
	void foo()
	  {
	   a = b;	//floating point to integer conversion
	  }

Code Example 3.3.2. Floating Point to Integer Conversion

3.3.3. Assembly Code Generated

Using the Microsoft Visual C++ Compiler

Code example 3.3.3 describes the assembly code generated by the Visual C++ 5.0 Compiler.

	?a@@3HA	DD	01H DUP (?)	; a
	?b@@3MA	DD	01H DUP (?)	; b
	fld	DWORD PTR ?b@@3MA	; loads the floating point value b
	call	__ftol			; calls function to convert value to integer
	mov	DWORD PTR ?a@@3HA, eax	; sets a= b

Code Example 3.3.3 Assembly Code Generated from the Visual C++ Compiler

Using the Intel C/C++ Compiler Plug-in

Code example 3.3.3.1 describes the assembly code generated by the Intel C/C++ Compiler Plug-in using the -Qrcd optimization option.

	fld DWORD PTR ?b@@3MA 		;loads the floating pint value b 
	fistp QWORD PTR [esp+8] 	;converts value to an integer
	mov eax, DWORD PTR [esp+8] 	;stores value to eax
	mov DWORD PTR ?a@@3HA, eax 	;sets a = b

Code Example 3.3.3.1 Assembly Code Generated from the Intel C/C++ Compiler Plug-in

3.3.4. Performance Comparison

The performance analysis of the assembly instructions generated by each compiler is provided in Table 3.3.4. The cycles counts were obtained by using the RDTSC and CPUID instructions on a 266Mhz Pentium II processor.

COMPILER USED	NUMBER OF ASSEMBLY LANGUAGE INSTRUCTIONS GENERATED FOR THE INTEGER CONVERSION	TOTAL NUMBER OF CYCLES FOR THE CONVERSION
Microsoft Visual C++ 5.0 Compiler	3 Instructions	135 Cycles
Intel C/C++ Compiler Plug-in	4 Instructions	25 Cycles

Table 3.3.4 Integer Conversion Performance Comparison

The floating point to integer conversion improves by 81% with the Intel C/C++ Compiler plug-in. This could be a substantial improvement to an overall application if these conversions occur frequently throughout the application.

4.0. Additional Resources

Additional information regarding the features of the Intel C/C++ Compiler Plug-in can be found in the "Intel C/C++ Compiler Plug-in User's Guide for Win32 Systems". This document is provided with the compiler. Additional information is also available on the following web site: http:\\support.intel.com/oem_developer/msl/ic

5.0. Appendix

5.1. CMOVcc Code Listing

#include <stdio.h>
#include <stdlib.h>
#define CPUID _asm _emit 0fh _asm _emit 0a2h
#define RDTSC _asm _emit 0fh _asm _emit 031h
int time_left[32];
int cyc;
int base;
void timeloop(int time_to_waste)
  {
  for(int i=0; i<32; i++)
    {
    int t = time_left[i];
    time_left[i] -= time_to_waste < t ? time_to_waste: t;
    }
  }
void main() 
   {
   //Base is used to calculate the time it takes to execute the CPUID and RDTSC inst.
   base = 0;	
   //Cyc will contain the amount of cycles taken to execute the timeloop function
   cyc = 0;
   _asm			//computes the base time of the RDTSC and CPUID calls	
    {
    CPUID
    RDTSC
    mov cyc,eax
    CPUID
    RDTSC
    sub eax, cyc
    mov base,eax
    }
   cyc = 0;		//initializes the cycles to zero
  _asm
   {
   CPUID		//computes the starting time
   RDTSC
   mov cyc,eax
   }
   timeloop(32);	//calls the timeloop function
  _asm
   {
   CPUID		//computes the ending time and total cycles
   RDTSC
   sub eax, cyc
   mov cyc, eax
   }
			//prints the number of cycles the timeloop function took
   printf("Base: %d\n",base);
   printf("Number of cycles: %d\n",(cyc-base));
  }

5.2. Floating Point to Integer Conversion Code Listing

#include <stdio.h>
#include <stdlib.h>
#define CPUID _asm _emit 0fh _asm _emit 0a2h
#define RDTSC _asm _emit 0fh _asm _emit 031h
int cyc;
int base;
int a = 5	//initialize the integer value;
float b = 1.4;	//initialize the floating point value
void foo()
  {
  a = b;	//floating point to integer conversion
  }
void main()
{
   //Base is used to calculate the time it takes to execute the CPUID and RDTSC inst.
   base = 0;	
   //Cyc will contain the amount of cycles taken to execute the timeloop function
   cyc = 0;
   _asm			//computes the base time of the RDTSC and CPUID calls	
    {
    CPUID
    RDTSC
    mov cyc,eax
    CPUID
    RDTSC
    sub eax, cyc
    mov base,eax
    }
   cyc = 0;		//initializes the cycles to zero
  _asm
   {
   CPUID		//computes the starting time
   RDTSC
   mov cyc,eax
   }
   foo();		//call the foo function
_asm
   {
   CPUID		//computes the ending time and total cycles
   RDTSC
   sub eax, cyc
   mov cyc, eax
   }
			//prints the number of cycles for the float to integer conversion
   printf("Base: %d\n",base);
   printf("Number of cycles: %d\n",(cyc-base));
  }

Intel C/C++ Compiler Plug-in

CONTENTS:

Using the Microsoft Visual C++ Compiler

Using Intel C/C++ Compiler Plug-in

Using the Microsoft Visual C++ Compiler

Using the Intel C/C++ Compiler Plug-in